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Abstract 

We consider multi-label prediction problems with large output spaces under the assumption of output sparsity 
- that the target (label) vectors have small support. We develop a general theory for a variant of the popular error 
correcting output code scheme, using ideas from compressed sensing for exploiting this sparsity. The method can be 
regarded as a simple reduction from multi-label regression problems to binary regression problems. We show that 
the number of subproblems need only be logarithmic in the total number of possible labels, making this approach 
radically more efficient than others. We also state and prove robustness guarantees for this method in the form of 
regret transform bounds (in general), and also provide a more detailed analysis for the linear prediction setting. 



1 Introduction 

Suppose we have a large database of images, and we want to learn to predict who or what is in any given one. A stan- 
dard approach to this task is to collect a sample of these images x along with corresponding labels y = (j/i, . . . , yd) & 
{0, l} d , where j/, = 1 if and only if person or object i is depicted in image x, and then feed the labeled sample to a 
multi-label learning algorithm. Here, d is the total number of entities depicted in the entire database. When d is very 
large {e.g. 10 3 , 10 4 ), the simple one-against-all approach of learning a single predictor for each entity can become 
prohibitively expensive, both at training and testing time. 

Our motivation for the present work comes from the observation that although the output (label) space may be very 
high dimensional, the actual labels are often sparse. That is, in each image, only a small number of entities may be 
present and there may only be a small amount of ambiguity in who or what they are. In this work, we consider how 
this sparsity in the output space, or output sparsity, eases the burden of large-scale multi-label learning. 

Exploiting output sparsity. A subtle but critical point that distinguishes output sparsity from more common notions 
of sparsity (say, in feature or weight vectors) is that we are interested in the sparsity of E[y|x] rather than y. In general, 
E[j/|x] may be sparse while the actual outcome y may not {e.g. if there is much unbiased noise); and, vice versa, y 
may be sparse with probability one but E[y|x] may have large support {e.g. if there is little distinction between several 
labels). 

Conventional linear algebra suggests that we must predict d parameters in order to find the value of the d-dimensional 
vector E[j/|x] for each x. A crucial observation - central to the area of compressed sensing 0~) - is that methods exist 
to recover E[y|x] from just 0(k logd) measurements when E[j/|x] is fc-sparse. This is the basis of our approach. 

Our contributions. We show how to apply algorithms for compressed sensing to the output coding approach 0. At 
a high level, the output coding approach creates a collection of subproblems of the form "Is the label in this subset or 
its complement?", solves these problems, and then uses their solution to predict the final label. 
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The role of compressed sensing in our application is distinct from its more conventional uses in data compression. 
Although we do employ a sensing matrix to compress training data, we ultimately are not interested in recovering 
data explicitly compressed this way. Rather, we learn to predict compressed label vectors, and then use sparse recon- 
struction algorithms to recover uncompressed labels from these predictions. Thus we are interested in reconstruction 
accuracy of predictions, averaged over the data distribution. 

The main contributions of this work are: 

1 . A formal application of compressed sensing to prediction problems with output sparsity. 

2. An efficient output coding method, in which the number of required predictions is only logarithmic in the 
number of labels d, making it applicable to very large-scale problems. 

3. Robustness guarantees, in the form of regret transform bounds (in general) and a further detailed analysis for 
the linear prediction setting. 

Prior work. The ubiquity of multi-label prediction problems in domains ranging from multiple object recognition in 
computer vision to automatic keyword tagging for content databases has spurred the development of numerous general 
methods for the task. Perhaps the most straightforward approach is the well-known one-against-all reduction 0, but 
this can be too expensive when the number of possible labels is large (especially if applied to the power set of the label 
space PI). When structure can be imposed on the label space (e.g. class hierarchy), efficient learning and prediction 
methods are often possible |5]|6][7][8||9). Here, we focus on a different type of structure, namely output sparsity, which 
is not addressed in previous work. Moreover, our method is general enough to take advantage of structured notions of 
sparsity {e.g. group sparsity) when available 1101 . Recently, heuristics have been proposed for discovering structure in 
large output spaces that empirically offer some degree of efficiency ifTTl . 

As previously mentioned, our work is most closely related to the class of output coding method for multi-class pre- 
diction, which was first introduced and shown to be useful experimentally in (2). Relative to this work, we expand 
the scope of the approach to multi-label prediction and provide bounds on regret and error which guide the design of 
codes. The loss based decoding approach lfl2l suggests decoding so as to minimize loss. However, it does not provide 
significant guidance in the choice of encoding method, or the feedback between encoding and decoding which we 
analyze here. 

The output coding approach is inconsistent when classifiers are used and the underlying problems being encoded 
are noisy. This is proved and analyzed in (T3), where it is also shown that using a Hadamard code creates a robust 
consistent predictor when reduced to binary regression. Compared to this method, our approach achieves the same 
robustness guarantees up to a constant factor, but requires training and evaluating exponentially (in d) fewer predictors. 

Our algorithms rely on several methods from compressed sensing, which we detail where used. 

2 Preliminaries 

Let X be an arbitrary input space and y C M d be a d-dimensional output (label) space. We assume the data source 
is defined by a fixed but unknown distribution over X x y. Our goal is to learn a predictor F : X — > y with low 
expected £| -error E x ||F(a;) — ||| (the sum of mean-squared-errors over all labels) using a set of n training data 

{(afi.Wi)}?=i- 

We focus on the regime in which the output space is very high-dimensional (d very large), but for any given x E X, 
the expected value E[y|x] of the corresponding label y € y has only a few non-zero entries. A vector is k-sparse if it 
has at most k non-zero entries. 
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3 Learning and Prediction 



3.1 Learning to Predict Compressed Labels 

Let A : R d -> R m be a linear compression function, where m < d (but hopefully m -C d). We use A to compress 
(i.e. reduce the dimension of) the labels y, and learn a predictor H : X — > A(^) of these compressed labels. Since A 
is linear, we simply represent A 6 jj™x d as a matrix. 

Specifically, given a sample {(xj, j/i)}™ =1 , we form a compressed sample {(xj, Ayi)}™ =1 and then learn a predictor H 
of ¥,[Ay\x] with the objective of minimizing the l\ -error E x \\H(x) — E[Ay|x]|||. 

3.2 Predicting Sparse Labels 

To obtain a predictor F of E[y|x], we compose the predictor H of E[Ay|x] (learned using the compressed sample) 
with a reconstruction algorithm R : M m — > W l . The algorithm R maps predictions of compressed labels h G U. m to 
predictions of labels y G y in the original output space. These algorithms typically aim to find a sparse vector y such 
that Ay closely approximates h. 

Recent developments in the area of compressed sensing have produced a spate of reconstruction algorithms with 
strong performance guarantees when the compression function A satisfies certain properties. We abstract out the 
relevant aspects of these guarantees in the following definition. 

Definition. An algorithm R is a valid reconstruction algorithmfor a family of compression functions (Ak C U m >i ^ m 
k G N) and sparsity error sperr : N x R d — > R, if there exists a function / : N — > N and constants C\ , C-x £l such 
that: on input k G N, A G ^4fc with m rows, and /i G R m , the algorithm R(fc, A, h) returns an /(fc)-sparse vector y 
satisfying 

< Ci-||/ l -Ayi|2 + C 2 -spcrr(fc,y) 
for all y G R d . The function / is the output sparsity of R and the constants C\ and C*2 are the regret factors . 

Informally, if the predicted compressed label H(x) is close to E[Aj/|x] = .AE[y|x], then the sparse vector y returned 
by the reconstruction algorithm should be close to E[y|x]; this latter distance \\y — E[y|x] ||| should degrade gracefully 
in terms of the accuracy of H{x) and the sparsity of E[y|x]. Moreover, the algorithm should be agnostic about the 
sparsity of E[?/|x] (and thus the sparsity error sperr(fc, E[y|x])), as well as the "measurement noise" (the prediction 
error \\H(x) — E[Aj/|x]||2). This is a subtle condition and precludes certain reconstruction algorithm {e.g. Basis 
Pursuit lfT4l ) that require the user to supply a bound on the measurement noise. However, the condition is needed in 
our application, as such bounds on the prediction error (for each x) are not generally known beforehand. 

We make a few additional remarks on the definition. 

1. The minimum number of rows of matrices A G Ak may in general depend on k (as well as the ambient 
dimension d). In the next section, we show how to construct such A with close to the optimal number of rows. 

2. The sparsity error spcrr(fc, y) should measure how poorly y G M d is approximated by a fc-sparse vector. 

3. A reasonable output sparsity f(k) for sparsity level k should not be much more than k, e.g. f(k) = 0(k). 

Concrete examples of valid reconstruction algorithms (along with the associated Ak, sperr, etc.) are given in the next 
section. 



3 



Algorithm 1 Training algorithm 



Algorithm 2 Prediction algorithm 



parameters sparsity level k, compression function 
A E Ak with m rows, regression learning algo- 
rithm L 

input training data S C X x R d 
for i = 1, . . . , to do 

h i ^L({(x,(Ay) i ):(x,y)eS}) 
end for 

output regressors H = [hi, ... , h m ] 



parameters sparsity level k, compression function 
A E Ak with m rows, valid reconstruction algo- 
rithm R for Ak 

input regressors H = [hi,..., h m ], test point x E 
X 

output y= R(k,A, [hi(x), . . . ,h m {x)\) 



Figure 1: Training and prediction algorithms. 



4 Algorithms 

Our prescribed recipe is summarized in Algorithms Q] and [2] 
reconstruction algorithms in the following subsections. 



We give some examples of compression functions and 



4.1 Compression Functions 

Several valid reconstruction algorithms are known for compression matrices that satisfy a restricted isometry property. 

Definition. A matrix A E ]R mxd satisfies the (k,5)-restricted isometry property ((k,S)-RIP), 5 E (0,1), if (1 — 
^IMI! < ll^lll < (l + <5)j|x||!forall/c-sparsea;eIR <i . 

While some explicit constructions of (k, S)-RIP matrices are known (e.g. |fT31l ), the best guarantees are obtained when 
the matrix is chosen randomly from an appropriate distribution, such as one of the following |fl6l[T7l . 

• All entries i.i.d. Gaussian N(0, 1/m), with m = 0(k \og(d/k)). 

• All entries i.i.d. Bernoulli 5(1/2) over {±l/^/m}, with m = 0{k\og(d/k)). 

• to randomly chosen rows of the d x d Hadamard matrix over {il/Vm}, with to = 0{k log 5 d). 

The hidden constants in the big-O notation depend inversely on 6 and the probability of success. 

A striking feature of these constructions is the very mild dependence of m on the ambient dimension d. This translates 
to a significant savings in the number of learning problems one has to solve after employing our reduction. 

Some reconstruction algorithms require a stronger guarantee of bounded coherence p,(A) < 0(l/k), where fi(A) 
defined as 

p(A) = max | (A T A) i . j \ / J\ (A T A) iyi \ \ (A^A) jd \ 

l<i<j<d v 

It is easy to check that the Gaussian, Bernoulli, and Hadamard-based random matrices given above have coherence 
bounded by 0(\/ (log d)/m) with high probability. Thus, one can take m = 0(k 2 logd) to guarantee 1/fc coherence. 
This is a factor k worse than what was needed for (fc, <5)-RIP, but the dependence on d is still small. 
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Algorithm 3 Prediction algorithm with R = OMP 

parameters sparsity level k, compression function A = [a±\ . . . \a,d\ £ A k with m rows, 
input regressors H = [hi,..., h m ], test point x £ X 

h <— [hi (x) , . . . , h m (x)] T (predict compressed label vector) 
y <- 0, J <- 0, r <- ft 
for i = 1, . . . ,2k do 

j» <— arg maxj | r dj \ / \\ aj || 2 (column of A most correlated with residual r) 
J <— J U {j*} (add to set of selected columns) 
yj <— (j4j)t/i, yjc <— (least-squares restricted to columns in J) 
r <— /i — ^4j/ (update residual) 
end for 
output y 



Figure 2: Prediction algorithm specialized with Orthogonal Matching Pursuit. 

4.2 Reconstruction Algorithms 

In this section, we give some examples of valid reconstruction algorithms. Each of these algorithm is valid with respect 
to the sparsity error given by 

spcrr(fc,y) = \\y - y {1 .. k) \\\ + -\\y - y^. k) \\ \ 

where y^i-k) is the best fc-sparse approximation of y {i.e. the vector with just the k largest (in magnitude) coefficients 
ofy). 

The following theorem relates reconstruction quality to approximate sparse regression, giving a sufficient condition 
for any algorithm to be valid for RIP matrices. 

Theorem 1. Let A k = {(k + f(k), S)-RIP matrices} for some function f : N — > N, and let A £ Ak have m rows. If 
for any h £ M. m , a reconstruction algorithm R returns an j '(k)-sparse solution y = R(fc, A, h) satisfying 

\\Ay-hg< inf C\\Ay {1:k} - h\\i 

then it is a valid reconstruction algorithm for A k and sperr given above, with output sparsity f and regret factors 
Ci = 2(1 + VC) 2 /(l - S) and C 2 = 4(1 + (1 + VC)/(1 - S)) 2 . 

Proofs are deferred to Section [6] 

Iterative and greedy algorithms. Orthogonal Matching Pursuit (OMP) JTU, FoBa Q9), and CoSaMP |20l are 
examples of iterative or greedy reconstruction algorithms. OMP is a greedy forward selection method that repeatedly 
selects a new column of A to use in fitting h (see Algorithmic}. FoBa is similar, except it also incorporates backward 
steps to un-select columns that are later discovered to be unnecessary. CoSaMP is also similar to OMP, but instead 
selects larger sets of columns in each iteration. 

FoBa and CoSaMP are valid reconstruction algorithms for RIP matrices ((8k, 0.1)-RIP and (Ak, 0.1)-RIP, respectively) 
and have linear output sparsity (8k and 2k). These guarantees are apparent from the cited references. For OMP, we 
give the following guarantee. 

Theorem 2. If [i(A) < 0.1/k, then after f(k) = 2k steps of OMP, the algorithm returns y satisfying 

\\Ay-hg<23\\Ay (1:k) -h\\l Vy £ R d . 

This theorem, combined with Theorem[T] implies that OMP is valid for matrices A with 11(A) < 0.1/k and has output 
sparsity f(k) = 2k. 
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l\ algorithms. Basis Pursuit (BP) [14] and its variants are based on finding the minimum £i-norm solution to a linear 
system. While the basic form of BP is ill-suited for our application (it requires the user to supply the amount of 
measurement error \\Ay — /t, |j 2), its more advanced path-following or multi-stage variants may be valid ll2D . 



5 Analysis 

5.1 General Robustness Guarantees 

We now state our main regret transform bound, which follows immediately from the definition of a valid reconstruction 
algorithm and linearity of expectation. 

Theorem 3 (Regret Transform). Let R be a valid reconstruction algorithm/or {Ak ■ k £ N} and sperr : NxK d ^ 1. 
Then there exists some constants C\ and C 2 such that the following holds. Pick any k £ N, A £ Ak with m rows, and 
H : X -> K m . Let F : X -> R d be the composition ofR(k, A, •) and H, i.e. F(x) = R(fc, A, H(x)). Then 



The simplicity of this theorem is a consequence of the careful composition of the learned predictors with the recon- 
struction algorithm meeting the formal specifications described above. 

In order compare this regret bound with the bounds afforded by Sensitive Error Correcting Output Codes (SECOC) 
|[T3l . we need to relate E x \\H(x) — E[Ay|a;]||| to the average scaled mean-squared-error over all induced regression 
problems; the error is scaled by the maximum difference L; = ma,x y& y(Ay)i — mm y (Ay)i between induced labels: 



In fc-sparse multi-label problems, we have y = {y £ {0, l} d : \\y\\o < k}. In these terms, SECOC can be tuned to 
yield E x \\F(x) - E[y\x]\\% < 4fc 2 • f for general k. 

For now, ignore the sparsity error. For simplicity, let A £ R mxd with entries chosen i.i.d. from the Bernoulli B(l/2) 
distribution over {±l/^m}, where m = 0(k\ogd). Then for any /c-sparse y, we have ||Aj/||ao < k/^/m, and thus 
Li < 2k/y/m for each i. This gives the bound 



which is within a constant factor of the guarantee afforded by SECOC. Note that our reduction induces exponentially 
(in d) fewer subproblems than SECOC. 

Now we consider the sparsity error. In the extreme case m = d, E,[y\x] is allowed to be fully dense (k = d) and 
sperr(fc,E[y|a;]) = 0. When m = 0(k\ogd) < d, we potentially incur an extra penalty in sperr(fc, E[y|x]), which 
relates how far E[y|x] is from being fc-sparse. For example, suppose E[y|x] has small l v norm for < p < 2. Then 
even if E[y|a;] has full support, the penalty will decrease polynomially in fc pa m/ log d. 

5.2 Linear Prediction 

A danger of using generic reductions is that one might create a problem instance that is even harder to solve than the 
original problem. This is an oft cited issue with using output codes for multi-class problems. In the case of linear 
prediction, however, the danger is mitigated, as we now show. Suppose, for instance, there is a perfect linear predictor 
of E[y\x], i.e. E[y\x] = B T x for some B £ W }Xd (here X = W). Then it is easy to see that H = BA T is a perfect 
linear predictor of E [Ay \x] : 



H T x = AB x = AE[y\x] = E[Ay\x}. 
The following theorem generalizes this observation to imperfect linear predictors for certain well-behaved A. 



E x \\F(x)-E[y\x}\\ 2 2 < C 1 -E x \\H(x)~E[Ay\x}\\ 2 2 + C 2 -speir(k,E[y\x}). 




d -E x \\H{x) ~E[Ay\x]\\l < 4Ci • fc 2 ■ f 
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Theorem 4. Suppose X C W. Let B € W xd be a linear function with 

K x \\B T x-¥.[y\x\\\l = e. 

Let A £ R" lX( i have entries drawn i.i.d. from N(0, 1/m), and let H = BA T . Then with high probability (over the 
choice of A), 

E x \\H T x-AE[y\x}\\ 2 2 < (l + 0(l/Jm))e. 

Remark 5. Similar guarantees can be proven for the Bernoulli-based matrices. Note that d does not appear in the 
bound, which is in contrast to the expected spectral norm of A: roughly 1 + 0(y d/m). 

Theorem |4] implies that the errors of any linear predictor are not magnified much by the compression function. So 
a good linear predictor for the original problem implies an almost-as-good linear predictor for the induced problem. 
Using this theorem together with known results about linear prediction 12211 . it is straightforward to derive sample 
complexity bounds for achieving a given error relative to that of the best linear predictor in some class. The bound 
will depend polynomially in k but only logarithmically in d. This is cosmetically similar to learning bounds for 
feature-efficient algorithms {e.g. 11231 1221 ) which are concerned with sparsity in the weight vector, rather than in the 
output. 



6 Proofs 

6. 1 Proof of Theorem U 

Let £ = k + f(k), y £ R d , and assume without loss of generality that \yi\ > ■ ■ ■ > \yd\- We need to show that 

\\V-V\\i < Ci-WAy-hg + C 2 -(\\A\\ 2 2 + k- 1 \\A\\ 2 1 ) 

where A = y — yn-.k)- Using the triangle inequality, the (£, <5)-RTP of A £ Ak, and the hypothesis that \\Ay — h]] 2 . < 
C\\Ay(i-.k) — we have 

Wv-vh < ||y-y(i:fc)||2 + ||A|| 2 

< {l-5y l / 2 \\Ay-Ay {1 .. k) \\ 2 + \\A\\ 2 

< (1 - 5)"V2 (\\Ay- hh + \\h- Ay {1:k) h) + ||A|| 2 

< (1 - 5)-V2(i + sfC)\\Ay {1 .. k) -hh+ ||A|| 2 

< (l-<5)- 1 / 2 (l + ^)(|jA 2/ -/ l || 2 + ||AA|| 2 ) + ||A|| 2 . (1) 

We need to relate ||AA|| 2 to ||A|| 2 and || A||i. Write A = J2i>o VJi> where Ji = {k + i£ + 1, . . . ,k + (i + l)t} 
and yj <G M. d is the vector whose jth component is yj if j € J and is otherwise. Note that each yj i is ^-sparse, 
y,7 i+ illoo 5: f- By Holder's inequality, 

Wyj^h < (l|y.7 1+1 ||oo||y.7 1+1 ||i) 1/2 < {t^WyjM 12 = r 1/2 \\vjA\i, 

and so 

Efclk < Hyjoll2 + Ell^ + illa < lly./ol| 2 + ^ 1/2 Ell^lli ^ IIAib + r^HAiu. 

i>0 i>0 i>0 

By the triangle inequality and the (£, <5)-RIP of A, we have 

\\AA\\ 2 < ^H^lb < ^(l + <5) 1/2 ||7y./ t || 2 < (l + ^^dlAlla + rVailAlU). 

i>0 i>0 
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Combining this final inequality with (Q]i gives 

\\v~vh < C -\\Ay-h\\ 2 + (1 + C (l + ^/ 2 )- (|| A|| 2 +^1^110 

where Co = (1 — <5) _1 / 2 (1 + \/C). Now squaring both sides and simplifying using the fact (x + y) 2 < 2x 2 + 2y 2 
concludes the proof. 

6.2 Proof of Theorem H 

We first begin with two simple lemmas. 

Lemma 6. Suppose OMP is run for k iterations starting with y(°> — 0, and produces intermediate solutions j/W , y&> , . . . , y^ k K 
Then there exists some < i < k such that if ji is the column selected in step i, then (aj. (h — Ay^)) 2 < ||/i|||/fc. 

Proof. Let r^' = h — Ay^>. Suppose column ji is added to J in step i. Let = yW + a^e^, where on = aj.r^ 

and ej i is the jith elementary vector. Then 

ll^lll-llr^ll^llrWlli-ll/,-^ 1 )!! 2 = Wr^Wl-Wh-Aiy^+a^Ml 



= llrWll 2 - llr-W 



i0 *||£ = 2a ia T r W _ a 2 \\a Jt \\ 2 = (ajr^) 2 . 



□ 



Moreover, Y7 l= o \\ r[j) \\l ~ \\ r{t+1) \\l = \\ r{0) \\l ~ \\ r{k) \\l < \\ h \\% s ° there is some i e {0, 1, . . . , k - 1} such that 

( a T r W)2 < || r (i)||2 _ |j r (*+i)||2 < \\h\\ 2 /k. 

Lemma 7. Ify <E R d is k-sparse and fi(A) < 5/(k - 1), then \\Ay\\l > (1 - 
This result also appears in Appendix Al of l24l . We reproduce the proof here. 

Proof. Expanding ||yli/|| 2 , we have 



1=1 i/j 
so we need to show this latter summation is at most <5||y|||. Indeed, 



E^jKX) 

i¥=3 



a j 



< E \yiyj\\ a i i 
<m(4)EIiwjI 



k k 



»( A ) EEnm-E^ 2 



(triangle inequality) 



(definition of coherence) 



= 1 3 = 1 



M E>l 



<M(A)(*lMli-IMIi) 

= /x(^)(Ar-l)||y||l 
<%lll 



(Cauchy-Schwarz) 
(assumption on fi(A)) 



which concludes the proof. 



□ 
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We are now ready to prove Theorem[2] Without loss of generality, we assume that the columns of A = [cti \ . . . \ad] 
are normalized (so 1 1 ct ^ 1 1 2 = 1) and that the support of y is (some subset of) {1, . . . , k} (so y is fc-sparse). 

In addition to the vector y returned by OMP and the vector y we want to compare to, we consider two other solution 
vectors: 



• y'\ a (2k — l)-sparse solution obtained by running up to k — 1 iterations of OMP starting from y. Lemma|6] 
implies that there exists such a vector y' with the following property: if j* is the column OMP would select 
when the current solution is y' , then 

(a],(h-Ay>)) 2 < \\h-Ay\\l/k. (2) 

Since y' is obtained by starting with y, it can only have smaller squared-error than y. Without loss of generality, 
let the support of y' be (some subset of) {1, ... , 2k}. 

• y': the actual solution produced by OMP (starting from 0) just before OMP chooses a column j ^ supp(y'). 
Note that if OMP never chooses a column j $ supp(i/') within 2k steps, then \\Ay — h\\\ < \\Ay' — h\\\ < 
\\Ay — /1H2 and the theorem is proven. Therefore we assume that this event does occurs and so y 1 is defined. 
Since y' precedes the final solution y returned by OMP, it can only have larger squared-error than y. 



We will bound \\h — Ay\\2 as follows: 

\\h — Ay\\-2 <\\h — Ay'\\ 2 (since y 1 precedes y) 

< \\h — Ay'\\2 + \\A(y' — y')\\2 (triangle inequality) 

< \\h — Ay\\2 + WAiff — y')\\ 2 - (since y precedes y') 

We thus need to bound \\A(y' — y')\\2 in terms of \\h — Ay\\2- 
Let r = h — Ay' and r = h — Ay'. Then 

\\A(y'-y')\\l = (Ay'-Ay') T A(y'- y ') 

= (h- Ay') T A(y' - y') - (h - Ay*) T A(y' - y') 

<\\h- Ay'hWAiy 4 - yOlh + \{h - Ay') T A{y' - y')\ (Cauchy-Schwarz) 

= \\r\\ 2 \\A(y' -y')h + \r T A(y' -y% 

Using the fact x < b^fx + c => x < (4/3) (fe 2 + c) (which in turn follows from the quadratic formula and the fact 
2xy < x 2 + y 2 ), the above inequality implies 

\W-y')\\l < \\r\\ 2 2 + \? T A(y'-yi)\. (3) 

We now work on bounding the second term on the righthand side. Let j > 2k be the column chosen by OMP when 
the current solution is y' . Then we have 

\ajf\ > \ajr\ W < 2k. (4) 
Also, since y' — y' has support {1, . . . , 2k}, we have that 

Atf-v') = A {1:2fc} (jT -y') (5) 
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where At\ : 2k\ is the same as A except with zeros in all but the first 2k columns. Then, 

\? T A{y' - y')\ = \? J A {l:2k} {y' - y')\ (Equation©) 

< \\r T A {1 . 2 k}\\oo\\y - y'\\i (Holder's inequality) 

< | a Jr| || $ — y' || i (Inequality ©) 

< (\ajr\ + \a]A{y' - y')\) \\ff - y'W, (triangle inequality) 

< (|oT r | + ||ajA {1:2fc} || 00 ||^ - y'Hi) \\g - y'W, (Equation© and Holder) 

< jajrllly 7 — y'\\i + ^(A)\\y' — y'\\\ (definition of coherence) 



^ T {a ^ r)2 + Tok llv ' -y'Wi + ^ A )\\y -y'Wl (since x y < (x 2 + y 2 )/2) 



< y(«7 r ) 2 + ~ v'Wi ( since < o.i/fc) 



< ^( a J r ) 2 + TT^Wv' - V'Wi) (Cauchy-Schwarz) 

2 J OK 
1 

< y(ajr) 2 + -\\A(y' - (Lemma|7) 
Continuing from Inequality ©, we have 

\\A{y'-y')\\l < 4||r||l + 10fc(ajr) 2 . 

Since (ajr) 2 < (aj»r) 2 , where j* < 2k is the column that OMP would select when the current solution is y', and 
since (aj,r) 2 < \\h — Ay\\ 2 /k (by Inequality (|2]i), we have that 

\\A{y'-y')\\ 2 < 4||r||3 + 10||fc-Ay||! 

< U\\h-Ay\\ 2 . 

Therefore, 

\\h-Ay>\\ 2 < (l + VU)\\h-Ay\\ 2 . 
Squaring both sides gives the conclusion. 



6.3 Proof of Theorem H 

We use the following Chernoff bound for sums of x 2 random variables, a proof of which can be found in the Appendix 
A of J25). 

Lemma 8. Fix any X% > . . . > Ad > 0, and let X\, . . . , Xd be i.i.d. \ 2 random variables with one degree of 
freedom. Then PrE 4 =i kX l > (l+7)E£=i A i] ^ exp(-(L> 7 2 /24) • (A/Ai)) for any < 7 < 1, where 
A = (Ai + . . . + X D )/D. 

Write A = (l/^/m)[8i \ ■ ■ ■ \0 m ] T , where each 9i is an independent d-dimensional Gaussian random vector N(0, Id). 
Define v x = B T x — E[y\x] so e = E^Hu^Hf, and assume without loss of generality that v x has full d-dimensional 
support. Using this definition and linearity of expectation, we have 

-. 771 -. 771 

K x \\Av x \\ 2 2 = -E x J2(6jv x ) 2 = - J2eJ(E x v x vJ)ei. 
m * — ' to * — ' 

i=l i=l 

Our goal is to show that this quantity is (1 + (9(l/ v / m))e with high probability. Since N(0,ld) is rotationally 
invariant and E x v x vJ is symmetric and positive definite, we may assume E x v x vJ is diagonal and has eigenvalues 
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Ai > . . . > Xd > 0. Then, the above expression simplifies to 



d 



i=l i=l j=l 



Each is a x 2 random variable with one degree of freedom, so EC 2 ^ = 1. Thus, the expected value of the above 
quantity is J2j=i trace (E-rVxvJ) = E T tracc(u T uJ) = E. T j|v x .j||. Now applying Lemma[8] with D = md variables 
and A = (A x +. . .+X d )/d, wehavePr[(l/m) Y,i,j X o e % > ( 1 + < ) £ ] ^ cxp(-(mdt 2 /24)(A/Ai)) < cxp(-mt 2 /24) 
(using the fact Ai < d\). This bound is 5 when t = J (24/m) ln(l/<5). 



7 Experimental Validation 

We conducted an empirical assessment of our proposed reduction on two labeled data sets with large label spaces. 
These experiments demonstrate the feasibility of our method - a sanity check that the reduction does in fact preserve 
learnability - and compare different compression and reconstruction options. 



7.1 Data 

Image dataQ The first data set was collected by the ESP Game |26l , an online game in which players ultimately 
provide word tags for a diverse set of web images. 

The set contains nearly 68000 images, with about 22000 unique labels. We retained just the 1000 most frequent labels: 
the least frequent of these occurs 39 times in the data, and the most frequent occurs about 12000 times. Each image 
contains about four labels on average. We used half of the data for training and half for testing. 

We represented each image as a bag-of-features vector in a manner similar to [27]. Specifically, we identified 1024 
representative SURF features points l28l from 10 x 10 gray-scale patches chosen randomly from the training im- 
ages; this partitions the space of image patches (represented with SURF features) into Voronoi cells. We then built a 
histogram for each image, counting the number of patches that fall in each cell. 

Text data0 The second data set was collected by Tsoumakas et al. IfTTI from del.icio.us,a social bookmarking 
service in which users assign descriptive textual tags to web pages. 

The set contains about 16000 labeled web page and 983 unique labels. The least frequent label occurs 21 times and 
the most frequent occurs almost 6500 times. Each web page is assigned 19 labels on average. Again, we used half the 
data for training and half for testing. 

Each web page is represented as a boolean bag-of-words vector, with the vocabulary chosen using a combination of 
frequency thresholding and x 2 feature ranking. See IfTTI for details. 

Each binary label vector (in both data sets) indicates the labels of the corresponding data point. 



7.2 Output Sparsity 

We first performed a bit of exploratory data analysis to get a sense of how sparse the target in our data is. We 
computed the least-squares linear regressor B £ W xd on the training data (without any output coding) and predicted 
the label probabilities p(x) = B T x on the test data (clipping values to the range [0, 1]). Using p(x) as a surrogate 

[ http : / / hunch . net /~ learning/ ESP -ImageSet . tar . gz 
"http: //mlkd.csd. auth . gr /mult i label . html 
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for the actual target E[j/|x], we examined the relative l\ error of p and its best fc-sparse approximation e(k i p(x)) = 

T,Lk+lP(i)( x ) 2 /\\P( x )W^ where P(l)(») > ■■■>P(d)(x). 

Examining E x e(k,p(x)) as a function of k, we saw that in both the image and text data, the fall-off with k is eventually 
super-polynomial, but we are interested in the behavior for small k where it appears polynomial k~ r for some r. 
Around k = 10, we estimated an exponent of 0.50 for the image data and 0.55 for the text data. This is somewhat 
below the standard of what is considered sparse (e.g. vectors with small £i-norm show fc _1 decay). Thus, we expect 
the reconstruction algorithms will have to contend with the sparsity error of the target. 

7.3 Procedure 

We used least-squares linear regression as our base learning algorithm, with no regularization on the image data and 
with ^-regularization w j m i ex i data (A = 0.01) for numerical stability. We did not attempt any parameter tuning. 

The compression functions we used were generated by selecting m random rows of the 1024 x 1024 Hadamard matrix, 
for m £ {100,200,300,400}. We also experimented with Gaussian matrices; these yielded similar but uniformly 
worse results. 

We tested the greedy and iterative reconstruction algorithms described earlier (OMP, FoBa, and CoSaMP) as well as 
a path-following version of Lasso based on LARS ED . Each algorithm was used to recover a fc-sparse label vector 
y k from the predicted compressed label H(x), for k = 1, . . . , 10. We measured the i\ distance |jy fe — y\\\ of the 
prediction to the true test label y. In addition, we measured the precision of the predicted support at various values of 
k using the 10-sparse label prediction. That is, we ordered the coefficients of each 10-sparse label prediction y 10 by 
magnitude, and measured the precision of predicting the first k coordinates | supp(yH'. fc j) n supp(j/)|/fc. Actually, for 
k > 6, we used y 2k instead of y 10 . 

We used correlation decoding (CD) as a baseline method, as it is a standard decoding method for ECOC approaches. 
CD predicts using the top k coordinates in A T H (x), ordered by magnitude. For mean-squared-error comparisons, we 
used the least-squares approximation of H (x) using these k columns of A. Note that CD is not a valid reconstruction 
algorithm when m < d. 

7.4 Results 

As expected, the performance of the reduction, using any reconstruction algorithm, improves as the number of induced 
subproblems m is increased (see Figures[3]and|4l at m = 300, 400, the precision-at-fc is nearly the same as one-against- 
all, i.e. m = 1024). When m is small and A $ Ak, the reconstruction algorithm cannot reliably choose k > K 
coordinates, so its performance may degrade after this point by over-fitting. But when the compression function A is 
in Ak for a sufficiently large K, then the squared-error decreases as the output sparsity k increases up to K. Note the 
fact that precision-at-fc decreases as k increases is expected, as fewer data will have at least k correct labels. 

All of the reconstruction algorithms at least match or out-performed the baseline on the mean-squared-error criterion, 
except when m = 100. When A has few rows, (1) A £ Ak only for very small K, and (2) many of its columns will 
have significant correlation. In this case, when choosing k > K columns, it is better to choose correlated columns to 
avoid over-fitting. Both OMP and FoBa explicitly avoid this and thus do not fare well; but CoSaMP, Lasso, and CD 
do allow selecting correlated columns and thus perform better in this regime. 

The results for precision-at-fc are similar to that of mean-squared-error, except that choosing correlated columns does 
not necessarily help in the small m regime. This is because the extra correlated columns need not correspond to 
accurate label coordinates. 

In summary, the experiments demonstrate the feasibility and robustness of our reduction method for two natural multi- 
label prediction tasks. They show that predictions of relatively few compressed labels are sufficient to recover an 
accurate sparse label vector, and as our theory suggests, the robustness of the reconstruction algorithms is a key factor 
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in their success. 
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Figure 3: Mean-squared-error versus output sparsity fc, m g {100, 200}. Top: image data. Bottom: text data. In each 
plot: the top set of lines corresponds to m = 100, and the bottom set to m = 200. 
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Figure 4: Mean precision-at-/s versus output sparsity k, m £ {100, 200}. Top: image data. Bottom: text data. In each 
plot: the top black unadorned line is one-against-all (m = 1024), the middle set of lines corresponds to m = 200, and 
the bottom set to m = 100. 
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